Business task for cyclistic case study

cyclistc is a bike sharing company based in chicago. Based on collected data stakeholders are interested in finding out how to convert casual riders into paying annual members. For easy conversion understanding behavours of each group on while using cyclistics is the one of the 3 focus of the team. which I was assinged to find out how casual riders differ from annual members in using thier services. which will be my business task for this project.

Stakeholders

1, cyclistic: bike sharing company 2, Cyclistic exeecutive team: approver of recommended marketing program. 3, Lily Moreno: director of marketig and manager 4, Cyclistic Marketing analytics team: source, collector and analysiers of data that helps guide cyclistic’s marketing strategy.

Data used for this analysis

Data used for this analysis was pre-collected for me by the cyclistic marketing analytics team. I have downloaded and stored the data approprately and created a dublicate of orgnal data before beging analysis.

Installing and Loading packages

install.packages("tidyverse", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/kiruk/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)
## package 'tidyverse' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\kiruk\AppData\Local\Temp\RtmpOMM2fS\downloaded_packages
installed.packages("lubridate", repos = "http://cran.us.r-project.org")
##      Package LibPath Version Priority Depends Imports LinkingTo Suggests
##      Enhances License License_is_FOSS License_restricts_use OS_type Archs
##      MD5sum NeedsCompilation Built
installed.packages("ggplot", repos = "http://cran.us.r-project.org")
##      Package LibPath Version Priority Depends Imports LinkingTo Suggests
##      Enhances License License_is_FOSS License_restricts_use OS_type Archs
##      MD5sum NeedsCompilation Built
installed.packages("plotly", repos = "http://cran.us.r-project.org")
##      Package LibPath Version Priority Depends Imports LinkingTo Suggests
##      Enhances License License_is_FOSS License_restricts_use OS_type Archs
##      MD5sum NeedsCompilation Built
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.1
## Warning: package 'lubridate' was built under R version 4.3.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(ggplot2)
library(skimr)
## Warning: package 'skimr' was built under R version 4.3.1
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.1
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
divvy_trip_01 <- read.csv(file.choose())
divvy_trip_02 <- read.csv(file.choose())
divvy_trip_03 <- read.csv(file.choose())
divvy_trip_04 <- read.csv(file.choose())
divvy_trip_05 <- read.csv(file.choose())
divvy_trip_06 <- read.csv(file.choose())
divvy_trip_07 <- read.csv(file.choose())
divvy_trip_08 <- read.csv(file.choose())
divvy_trip_09 <- read.csv(file.choose())
divvy_trip_10 <- read.csv(file.choose())
divvy_trip_11 <- read.csv(file.choose())
divvy_trip_12 <- read.csv(file.choose())

It is crucial to check all data frames have similar column names before merging.

trips<- bind_rows(divvy_trip_01,divvy_trip_02,divvy_trip_03, divvy_trip_04,divvy_trip_05,divvy_trip_06,divvy_trip_07,divvy_trip_08,divvy_trip_09,divvy_trip_10,divvy_trip_11, divvy_trip_12)

Using bind_rows function rows from each individual dataframes will merge into one large data frame called trips.

skim(trips)
Data summary
Name trips
Number of rows 5667717
Number of columns 13
_______________________
Column type frequency:
character 9
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1 16 16 0 5667717 0
rideable_type 0 1 11 13 0 3 0
started_at 0 1 19 19 0 4745862 0
ended_at 0 1 19 19 0 4758633 0
start_station_name 0 1 0 64 833064 1675 0
start_station_id 0 1 0 44 833064 1314 0
end_station_name 0 1 0 64 892742 1693 0
end_station_id 0 1 0 44 892742 1318 0
member_casual 0 1 6 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
start_lat 0 1 41.90 0.05 41.64 41.88 41.90 41.93 45.64 ▇▁▁▁▁
start_lng 0 1 -87.65 0.03 -87.84 -87.66 -87.64 -87.63 -73.80 ▇▁▁▁▁
end_lat 5858 1 41.90 0.07 0.00 41.88 41.90 41.93 42.37 ▁▁▁▁▇
end_lng 5858 1 -87.65 0.11 -88.14 -87.66 -87.64 -87.63 0.00 ▇▁▁▁▁

Data Cleaning

Removing Empty, NA, and Dublicates.

Merged data frame trips has empty colmun values in start_station_name, start_station_id, end_station_name, and end_station_id which are 833064,833064,892742, and 892742 respectivly. In addition to empty columns, n_missing are 5858 in both end_lat and end_lng which is affecting our complete_rate.

trips<- replace(trips, trips=='', NA)

Using replace function empty cells in the data frame will be assigned NA value. Doing so will make our data frame have 0 empty and all cells with NA Will be counted as n_missing instead.

trips<- trips[complete.cases(trips),]

Complete.cases will remove all values stored as NA thus giving us a data frame with no missing or empty values.

trips_02 <- trips %>%
  distinct() %>%
  unique()

since I am removing data I Will create a new version called trips_02. Code chunk will remove dublicates. It’s important to consult with stakeholders about missing values before removing as it can be filled to make it complete. But on this case missing values Will be removed.

Selecting columns that are specific to our analysis from the data frame.

trips_02<- trips_02 %>%
  select(ride_id, rideable_type, started_at, ended_at, start_station_name, start_station_id, end_station_name, end_station_id, member_casual)

The function Select will pick out selected columns out of trips_02 data frame to create a data frame that focus on stakeholders interest of answering user behavior difference. The removed columns dont aide in answering this question thus removed in the new version of trips data frame.

colnames(trips_02)
## [1] "ride_id"            "rideable_type"      "started_at"        
## [4] "ended_at"           "start_station_name" "start_station_id"  
## [7] "end_station_name"   "end_station_id"     "member_casual"

A check to see data frame has all selected columns.

trips_02$trip_duration <- difftime(trips_02$ended_at,trips_02$started_at, units = "secs")

creating a new column named trip_duration that is the difference between trip ending and trip beginning. This will answer the question of trip duration among user types.

str(trips_02)
## 'data.frame':    4369360 obs. of  10 variables:
##  $ ride_id           : chr  "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : chr  "2022-01-13 11:59:47" "2022-01-10 08:41:56" "2022-01-25 04:53:40" "2022-01-04 00:18:04" ...
##  $ ended_at          : chr  "2022-01-13 12:02:44" "2022-01-10 08:46:17" "2022-01-25 04:58:01" "2022-01-04 00:33:00" ...
##  $ start_station_name: chr  "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
##  $ start_station_id  : chr  "525" "525" "TA1306000016" "KA1504000151" ...
##  $ end_station_name  : chr  "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
##  $ end_station_id    : chr  "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
##  $ member_casual     : chr  "casual" "casual" "member" "casual" ...
##  $ trip_duration     : 'difftime' num  177 261 261 896 ...
##   ..- attr(*, "units")= chr "secs"
trips_02$trip_duration <- as.numeric(as.difftime(trips_02$trip_duration))

Converting trip_duration to numeric value from difftime. This change will enable calculations on trip_duration.

is.numeric(trips_02$trip_duration)
## [1] TRUE

checking our conversion was successful.

Creating New columns

trips_02$date <- as.Date(trips_02$started_at)
trips_02$month <- format(as.Date(trips_02$date),"%m")
trips_02$day <- format(as.Date(trips_02$date),"%d")
trips_02$year <- format(as.Date(trips_02$date), "%Y")
trips_02$day_of_week <- format(as.Date(trips_02$date),"%A")
trips_02$hour_minutes <- format(as.POSIXct(trips_02$started_at), "%H:%M")

The above code chunk created new columns that will further assist in answering stakeholders issues. All columns created are derived from the started_at column and they are as.Date format. By dividing started_at column to month,day,year, day_of_week, and hour_minute it enables an enhanced data analysis and visualization.

table(trips_02$rideable_type)
## 
##  classic_bike   docked_bike electric_bike 
##       2597426        174858       1597076
table(trips_02$member_casual)
## 
##  casual  member 
## 1758189 2611171
trips_03<- trips_02[!(trips_02$trip_duration <0 | trips_02$rideable_type == "docked_bike"),]

Table(trips_02$rideable_type) reveals that there are 3 types of rides and docked_bikes takes 4 % of the total rides taken out of the year. Its a minority of rides that are classified as docked_bikes. In addition arranging trip_duration from highest to lowest reveals that rides classified as docked_bikes go over a month in duration of rides which is significant compared to a normal bike ride. when removing docked_bikes ride duration drop to less than 24 hour. Correction procedures could include asking data collectors for correction, explanation or removal. For this analysis removal is the only option as the other options are unavailable. The above code chunk will remove all 174858 rides labeled as docked_bike and trips with negative values in the trip_duration column.

table(trips_03$rideable_type)
## 
##  classic_bike electric_bike 
##       2597398       1597035
colnames(trips_03)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "member_casual"     
## [10] "trip_duration"      "date"               "month"             
## [13] "day"                "year"               "day_of_week"       
## [16] "hour_minutes"
str(trips_03)
## 'data.frame':    4194433 obs. of  16 variables:
##  $ ride_id           : chr  "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : chr  "2022-01-13 11:59:47" "2022-01-10 08:41:56" "2022-01-25 04:53:40" "2022-01-04 00:18:04" ...
##  $ ended_at          : chr  "2022-01-13 12:02:44" "2022-01-10 08:46:17" "2022-01-25 04:58:01" "2022-01-04 00:33:00" ...
##  $ start_station_name: chr  "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
##  $ start_station_id  : chr  "525" "525" "TA1306000016" "KA1504000151" ...
##  $ end_station_name  : chr  "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
##  $ end_station_id    : chr  "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
##  $ member_casual     : chr  "casual" "casual" "member" "casual" ...
##  $ trip_duration     : num  177 261 261 896 362 ...
##  $ date              : Date, format: "2022-01-13" "2022-01-10" ...
##  $ month             : chr  "01" "01" "01" "01" ...
##  $ day               : chr  "13" "10" "25" "04" ...
##  $ year              : chr  "2022" "2022" "2022" "2022" ...
##  $ day_of_week       : chr  "Thursday" "Monday" "Tuesday" "Tuesday" ...
##  $ hour_minutes      : chr  "11:59" "08:41" "04:53" "00:18" ...
skim(trips_03)
Data summary
Name trips_03
Number of rows 4194433
Number of columns 16
_______________________
Column type frequency:
character 14
Date 1
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1 16 16 0 4194433 0
rideable_type 0 1 12 13 0 2 0
started_at 0 1 19 19 0 3653923 0
ended_at 0 1 19 19 0 3665869 0
start_station_name 0 1 7 64 0 1556 0
start_station_id 0 1 3 44 0 1270 0
end_station_name 0 1 10 64 0 1598 0
end_station_id 0 1 3 44 0 1281 0
member_casual 0 1 6 6 0 2 0
month 0 1 2 2 0 12 0
day 0 1 2 2 0 31 0
year 0 1 4 4 0 1 0
day_of_week 0 1 6 9 0 7 0
hour_minutes 0 1 5 5 0 1440 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2022-01-01 2022-12-31 2022-07-21 365

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
trip_duration 0 1 941.7 1601.01 0 355 616 1085 89965 ▇▁▁▁▁

A final check for data frame completeness and any issues to be cleaned before continuing on to the analysis phase.

Data Analysis

min(trips_03$trip_duration)
## [1] 0
max(trips_03$trip_duration)
## [1] 89965
mean(trips_03$trip_duration)
## [1] 941.7029
median(trips_03$trip_duration)
## [1] 616

Using min, max, mean and median will indicate our data boundary on the trips_duration, average and median data.

aggregate(trips_03$trip_duration~trips_03$member_casual, FUN = min)
##   trips_03$member_casual trips_03$trip_duration
## 1                 casual                      0
## 2                 member                      0

aggregating data to specific column in this case to member_casual shows what the minumum trip duration is among member types.

aggregate(trips_03$trip_duration~trips_03$member_casual, FUN = max)
##   trips_03$member_casual trips_03$trip_duration
## 1                 casual                  89965
## 2                 member                  89594

aggregating data to specific column in this case to member_casual shows what the maximum trip duration is among member types.

aggregate(trips_03$trip_duration~trips_03$member_casual, FUN = mean)
##   trips_03$member_casual trips_03$trip_duration
## 1                 casual              1262.5937
## 2                 member               747.1275

aggregating data to specific column in this case to member_casual shows what the average trip duration is among member types.

aggregate(trips_03$trip_duration~trips_03$member_casual, FUN = median)
##   trips_03$member_casual trips_03$trip_duration
## 1                 casual                    776
## 2                 member                    539

aggregating data to specific column in this case to member_casual shows what the median trip duration is among member types.

aggregate(trips_03$trip_duration~trips_03$member_casual+trips_03$day_of_week, FUN = mean)
##    trips_03$member_casual trips_03$day_of_week trips_03$trip_duration
## 1                  casual               Friday              1194.1929
## 2                  member               Friday               733.6164
## 3                  casual               Monday              1276.5298
## 4                  member               Monday               721.9727
## 5                  casual             Saturday              1415.5713
## 6                  member             Saturday               838.8530
## 7                  casual               Sunday              1434.4400
## 8                  member               Sunday               831.3002
## 9                  casual             Thursday              1128.0301
## 10                 member             Thursday               721.9113
## 11                 casual              Tuesday              1133.0574
## 12                 member              Tuesday               707.4686
## 13                 casual            Wednesday              1094.0147
## 14                 member            Wednesday               710.8143

aggregating data to specific column in this case to member_casual and day_of_week shows what the average trip duration is among member types in a specific day of the week. However, inthis case the days are out of order and a bit hard to read.

trips_03$day_of_week <- ordered(trips_03$day_of_week,levels=c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"))

using ordered function we can rearrange days of the week from sunday to saturday making it easy to read.

aggregate(trips_03$trip_duration~trips_03$member_casual + trips_03$rideable_type, FUN=mean)
##   trips_03$member_casual trips_03$rideable_type trips_03$trip_duration
## 1                 casual           classic_bike              1466.8318
## 2                 member           classic_bike               794.0642
## 3                 casual          electric_bike              1001.2371
## 4                 member          electric_bike               658.2664

aggregating data to specific column in this case to member_casual and rideable_type indicates what the average trip duration is among member types for each type of ride types in the data frame.

trips_04 <- trips_03 %>%
  mutate(weekday=wday(started_at, label=TRUE)) %>%
  group_by(weekday,member_casual) %>%
  summarise(number_of_rides = n(), average_duration = mean(trip_duration)) %>%
  arrange(member_casual,weekday)
## `summarise()` has grouped output by 'weekday'. You can override using the
## `.groups` argument.

This new dataframe consists of 4 columns with 14 rows which arranges the average trip duration for each rides along side membership type per each day of the week.

trips_05 <- trips_03 %>%
  mutate(weekday=wday(started_at, label=TRUE)) %>%
  group_by(member_casual,weekday,rideable_type) %>%
  summarise(number_of_rides = n(), average_duration = mean(trip_duration)) %>%
  arrange(member_casual,weekday)
## `summarise()` has grouped output by 'member_casual', 'weekday'. You can
## override using the `.groups` argument.

This new data frame consists of 5 columns with 28 rows. This data frame is identical to the previous data frame but this one has one more variable which is rideable_type. which arranges the average trip duration for each rides along side membership type per each day of the week by ride type.

trips_06 <- trips_03 %>%
mutate(weekday=wday(started_at,label = TRUE)) %>%
  group_by(member_casual,rideable_type) %>%
  summarise(number_of_rides = n(),
  average_duration = mean(trip_duration)) %>%
  arrange(member_casual,rideable_type)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
trips_06$number_of_rides <- as.numeric(as.character(trips_06$number_of_rides))
trips_06$average_duration <- as.numeric(as.character(trips_06$average_duration))
class(trips_06$number_of_rides)
## [1] "numeric"
class(trips_06$average_duration)
## [1] "numeric"

This new data frame consists of 4 columns and 4 rows. The data frame provides the total number of rides alongside average trip duration by membership type and ride type.

trips_07 <- trips_03 %>%
  select(ride_id,rideable_type) %>%
  group_by(rideable_type) %>%
  summarise(number_of_rides = n())

This new data frame has 2 columns and 2 rows. The data frame provides the total number of rides by ride type.

trips_08 <- trips_03 %>%
  select(ride_id,hour_minutes,member_casual) %>%
  group_by(hour_minutes,member_casual)%>%
  summarise(number_of_rides = n())%>%
  arrange(hour_minutes) 
## `summarise()` has grouped output by 'hour_minutes'. You can override using the
## `.groups` argument.

This data frame has 3 columns and 2880. It arranges the yearly ride totals by membership type and each minute of a day. This will help in visualizing peak ride hours in the day by membership type for the total annual rides taken in the data frame.

Visualization

x <- trips_07$number_of_rides

label <- trips_07$rideable_type

piepercent <- round(100*x/sum(x),1)

pie(x, labels = piepercent, main = "Percentage of rides by ride type", col = c("lightblue","orange"))
legend("bottomright", c("classic_bike","electric_bike"), cex = 1, fill = c("lightblue","orange"))

ggplot(trips_06)+
  geom_point(mapping = aes(x=member_casual, y=number_of_rides, color=rideable_type))+
  scale_y_continuous(breaks = seq(600000,2000000, by =100000))+
  xlab("Casual riders VS Members")+
  ylab("total number of rides taken")+
  labs(title="Casual riders VS members annual rideship by ride type")

ggplot(trips_06) +
  geom_point(mapping = aes(x=average_duration, y=number_of_rides, color = member_casual, shape = rideable_type)) +
  scale_y_continuous(breaks = seq(600000,2000000, by=100000))+
  xlab("Average duration of trips by second")+
  ylab("Total number of rides taken")+
  labs(title = "Average trip duration by ride type and membership")

ggplot(trips_04, mapping = aes(x=weekday,y=number_of_rides, fill=member_casual))+
  geom_col(position = "dodge")

ggplot(trips_05)+
  geom_point(mapping = aes(x=weekday, y=number_of_rides, color= member_casual, shape=rideable_type))

ggplot(trips_04, mapping = aes(x=weekday,y=average_duration, fill=member_casual))+
  geom_col(position = "dodge")+
  ylab("Average trip duration by second")

ggplot(trips_05)+
  geom_point(mapping = aes(x=weekday, y=average_duration, color= member_casual, shape=rideable_type))

plot2<- ggplot(trips_08, mapping = aes(x=hour_minutes, y=number_of_rides))+
  geom_col()+
  scale_x_discrete(breaks=c("12:00","06:00","18:00","00:00","23:59","23:59"))+
   theme(
        axis.title = element_text( color="red", size=10, face=2),
        axis.line = element_line(linewidth = 3, colour = "lightgreen", linetype=1),
        axis.text = element_text( angle = 90, color="blue", size=7, face=2)
        )+
  facet_wrap(~member_casual)+
  labs(title = "Number of rides for each hour of the day in a year.")
ggplotly(plot2)